Collaborators

Include the names of other students you worked with here.

Overview

You will work with 3 data sets in this assignment. Two of the data sets will be explored in both R and Python. This allows you to direclty compare figures between the two languages. Two of the data sets are new while the third is a data set from the previous assignment. You will practice lumping low frequency counts in this assignment and consider the consequences of such actions. You will also practice visualizing continuous distributions using synthetic and real data.

The three data sets for this assignment are:

You will explore the WPIAL and synthetic data in both R and Python. The EPA AQS data will only be explored in R.

The last problem, Problem 06, requires you to read in potential data for your final project. You may answer this problem in either Python or R.

Important: Please download all CSV files from the Canvas site and save them in the same directory as your RMarkdown file. The code chunks read in the CSV files assuming they are in the same directory.

Problem 00

Load or import the tidyverse in the code chunk below.

SOLUTION

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Problem 01

The WPIAL is the Highschool athletic league in Western Pennsylvania. Each year Highschools across Western PA compete to win the Western PA title and move on to the State championship. For Baseball, the WPIAL is divided into multiple classes based on Highschool district size. Schools with fewer students compete in Class 1A while the largest Schools compete in 6A. The WPIAL Baseball Champions data are read in for you in the code chunk below. The data are assigned to the wpial object. There are three columns in the data set:

The data you are working with does not consist of all WPIAL winners. You are working with data starting from 1979 even though the first WPIAL Baseball championship took place in 1914. You are working with more recent data because school districts were very different over 100 years ago.

wpial <- readr::read_csv('wpial_baseball_champs.csv', col_names = TRUE)
## Rows: 142 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Class, School
## dbl (1): Year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

1a)

Use glimpse() to check the number of rows, columns, and data types of the columns in wpial.

SOLUTION

Add you code chunk here.

wpial %>% glimpse()
## Rows: 142
## Columns: 3
## $ Year   <dbl> 2023, 2023, 2023, 2023, 2023, 2023, 2022, 2022, 2022, 2022, 202…
## $ Class  <chr> "6A", "5A", "4A", "3A", "2A", "1A", "6A", "5A", "4A", "3A", "2A…
## $ School <chr> "Mt. Lebanon", "Shaler", "Hopewell", "Riverside", "Seton LaSall…

1b)

Create a bar chart to show the counts for the Class variable.

SOLUTION

Add you code chunk here.

wpial %>% 
  ggplot(mapping = aes(x = Class)) +
  geom_bar(mapping = aes(group = Class)) +
  theme_bw()

1c)

Which Class has the most rows? Class 2A

Which class has the fewest rows?
Class 5A and 6A

1d)

Create a bar chart to show the counts for the School variable. The names should be relatively easy to read.

SOLUTION

wpial %>% 
  ggplot(mapping = aes(y = School)) +
  geom_bar(mapping = aes(group = School)) +
  theme_bw()

### 1e)

Create a bar chart to show the counts for the School variable again. However, this time the bar chart MUST be in ASCENDING frequency order. The names should be relatively easy to read.

SOLUTION

Add your code chunks here.

wpial %>% 
  count(School) %>% 
  arrange(School) %>% 
  mutate(proportion = n / sum(n),
         rolling_sum = cumsum(n),
         rolling_prop = rolling_sum / sum(n)) %>% 
  ggplot(mapping = aes(y = School)) +
  geom_col(mapping = aes(x = rolling_prop)) +
  theme_bw()

1f)

Let’s now consider the combination of School and Class.

Create a STACKED bar chart to show the counts of the School variable. The bar fill must be associated with the Class variable. The bar chart must be in ASCENDING frequency order. The names should be relatively easy to read.

Use the color blind friendly color palette.

SOLUTION

Question

wpial %>% 
  count(School, Class) %>% 
  arrange(School) %>% 
  mutate(proportion = n / sum(n),
         rolling_sum = cumsum(n),
         rolling_prop = rolling_sum / sum(n)) %>% 
  ggplot(mapping = aes(y = School)) +
  geom_col(mapping = aes(x = rolling_prop)) +
  geom_bar(mapping = aes(fill = Class))+
  ggthemes::scale_color_colorblind() +
  theme_bw()

1g)

Let’s now consider the combinations of School and Class using a heat map.

Create a heat map to show the combinations of School and Class. The heat map fill must be associated with the count using a sequential color scale. The count must also be annotated via text and the text color must depend on the count.

SOLUTION

Question

  wpial %>% 
  count(School, Class) %>% 
  arrange(School) %>% 
  mutate(proportion = n / sum(n),
         rolling_sum = cumsum(n),
         rolling_prop = rolling_sum / sum(n)) %>% 
  ggplot(mapping = aes(x = Class, y = School)) +
  geom_tile(mapping = aes(fill = rolling_prop),
            color = 'black') +
  scale_fill_viridis_c() +
  theme_bw()

1h)

How many different Class championships have been won by the School with the most total WPIAL championships?

SOLUTION

What do you think? I used a code chunk and it says 8, but when i try to knit, it gives me errors for trying to print n() at the last line so I deleted it.

wpial %>% 
  mutate(class = fct_infreq(School)) %>% 
  ggplot(mapping = aes(y =School)) +
  geom_bar() +
  theme_bw()

wpial %>%
  group_by(School) %>%
  summarise(total_championships = sum(n())) %>%
  filter(total_championships == max(total_championships)) %>% 
  pull(School)
## [1] "North Allegheny"
wpial_counts <- wpial %>%
  count(School, Class) %>% 
  group_by(School, Class)

1i)

How many schools have won a single WPIAL championship?

SOLUTION

Add your code chunks here if you want to.

wpial %>%
  count(School, Class) %>% 
  group_by(School, Class) %>%
  filter(n == 1) %>%
  pull(School)
##  [1] "Avonworth"         "Beaver"            "Beaver"           
##  [4] "Beaver Falls"      "Bishop Canevin"    "Bishop Canevin"   
##  [7] "Blackhawk"         "Blackhawk"         "Blackhawk"        
## [10] "Brownsville"       "Canevin Catholic"  "Canon-McMillan"   
## [13] "Central Catholic"  "Chartiers Valley"  "Derry"            
## [16] "Elizabeth Forward" "Fort Cherry"       "Franklin Regional"
## [19] "Freeport"          "Greater Latrobe"   "Greensburg C.C."  
## [22] "Hopewell"          "Jeannette"         "Knoch"            
## [25] "Moon"              "Mt. Lebanon"       "New Castle"       
## [28] "North Allegheny"   "North Catholic"    "North Hills"      
## [31] "Northgate"         "Norwin"            "Pine-Richland"    
## [34] "Quaker Valley"     "Seneca Valley"     "Serra Catholic"   
## [37] "Shady Side Acad."  "Shenango"          "South Fayette"    
## [40] "South Fayette"     "Springdale"        "Steel Valley"     
## [43] "Steel Valley"      "Sto-Rox"           "Swissvale"        
## [46] "Upper St. Clair"   "Vincentian Acad."  "Waynesburg Cent." 
## [49] "West Allegheny"    "West Allegheny"    "West Mifflin"     
## [52] "Wilmington"

Problem 02

Hopefully you have seen that there are many different Schools in the data set! We learned that LUMPING low frequency categories can help simplify figures when there are a large number of categories. However, it is critical to consider the consequence of such actions! Let’s see what those consequences could be with WPIAL data set.

2a)

There are multiple lumping procedures that you could consider, but let’s use a simple one in this assignment. You will lump based on a PROPORTION threshold. All LEVELS or CATEGORIES with LESS THAN a THRESHOLD proportion will be LUMPED into an OTHER category. In lecture, we used a proportion threshold of 5%. However, you will use a proportion threshold of 1% in this assignment.

Lump the low frequency categories of School into an OTHER category. Name the OTHER category '(OTHER)'. Visualize the counts of the remaining categories via a bar chart. The bar chart must be in ASCENDING frequency order. The names should be relatively easy to read.

SOLUTION

Add your code chunks here.

wpial %>% 
  mutate(School = fct_lump_prop(School, prop = 0.01, other_level = '(other)' )) %>% 
  mutate(School = fct_infreq(School)) %>% 
  mutate(School = fct_rev(School)) %>% 
  ggplot(mapping = aes(y=School)) +
  geom_bar() +
  theme_bw()

wpial %>% 
  mutate(School_lump = fct_lump_prop(School, prop = 0.008, other_level = '(other)') )%>% 
  mutate(School_factor = as.factor(School))%>% 
  summary()
##       Year         Class              School                   School_lump
##  Min.   :1979   Length:142         Length:142         (other)        :31  
##  1st Qu.:1996   Class :character   Class :character   North Allegheny: 8  
##  Median :2008   Mode  :character   Mode  :character   California     : 6  
##  Mean   :2006                                         Neshannock     : 6  
##  3rd Qu.:2017                                         Pine-Richland  : 6  
##  Max.   :2023                                         Riverside      : 6  
##                                                       (Other)        :79  
##          School_factor
##  North Allegheny:  8  
##  California     :  6  
##  Neshannock     :  6  
##  Pine-Richland  :  6  
##  Riverside      :  6  
##  Hopewell       :  5  
##  (Other)        :105

2b)

You will now visualize the combinations of the lumped School with Class via a heat map.

Lump the low frequency categories of School into an OTHER category. Name the OTHER category '(OTHER)'. Visualize the counts of the combinations of the remaining School categories and Class via a heat map. The heat map fill must be associated with the count using a sequential color scale. The count must also be annotated via text and the text color must depend on the count.

You must continue to use a PROPORTION threshold of 1%.

SOLUTION

Add your code chunks here.

wpial %>% 
  mutate(School = fct_lump_prop(School, prop = 0.01, other_level = '(other)') )%>% 
  count(Class, School) %>% 
  ggplot(mapping = aes(x = Class, y = School)) +
  geom_tile(mapping = aes(fill = n,
            color = 'black')) +
  scale_fill_viridis_c() +
  theme_bw()

2c)

How would you describe the impact of the LUMPING procedure? Did it help? Did it modify the organizational nature of the School categories?

SOLUTION

The lumping procedure in this case grouped the low-frequency categories of the School variable into an “other” category. This helps by reducing the number of distinct categories and potentially simplifying the analysis. It can also help in handling rare or less significant categories that may not provide meaningful insights individually.

2d)

You executed the lumping within this assignment based on a proportion threshold dictated to you. In practice, you should always consider the category proportions BEFORE using an assumed threshold.

Visualize the proportion bar chart for School without lumping. The bar chart must be in ASCENDING order. The names should be relatively easy to read.

SOLUTION

wpial %>% 
  mutate(School = fct_infreq(School),
         School = fct_rev(School)) %>% 
  ggplot(mapping = aes(y =School)) +
  geom_bar() +
  theme_bw()

wpial %>% 
  mutate(School = fct_infreq(School),
         School = fct_rev(School)) %>% 
  ggplot(mapping = aes(y =School)) +
  geom_bar(mapping =  aes(x= after_stat(count/sum(count) ))) +
  theme_bw()

2e)

Why does the figure in 2d) explain what happened with the LUMPING procedure?

SOLUTION

What do you think? I tnink there is a chance we have overwrite the variables

Problem 03

You will now explore continuous variable distributions. You will do so using synthetic data generated for this assignment. The synthetic data are read in for you in the code chunk below and assigned to the df object. There are 9 continuous variables in the data set. These 9 variables were randomly generated using distributions that you will commonly encounter in data analysis tasks. Thus, exploring these 9 variables will prepare you for common challenges you will experience when exploring continuous variables.

df <- readr::read_csv('hw03_synthetic.csv', col_names = TRUE)
## Rows: 375 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (9): x1, x2, x3, x4, x5, x6, x7, x8, x9
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

This problem is more open ended compared to others. You must create the following visualizations for EACH variable within df.

df %>%  glimpse()
## Rows: 375
## Columns: 9
## $ x1 <dbl> 8.120639, 10.550930, 7.493114, 14.785842, 10.988523, 7.538595, 11.4…
## $ x2 <dbl> 9.960223, 14.826853, 10.615168, 5.542776, 9.304117, 6.307572, 9.218…
## $ x3 <dbl> 9.047559, 13.590404, 27.412923, 13.446587, 15.069754, 8.629539, 11.…
## $ x4 <dbl> 1.505284, 2.376810, 1.754423, 7.392501, 1.702202, 0.974089, 20.4096…
## $ x5 <dbl> 2.743911439, 0.517339097, 1.352082323, 1.390951409, 0.693554528, 1.…
## $ x6 <dbl> 2.4843101, 3.0625250, 0.4382848, 1.1873050, 1.3511870, 1.2452908, 0…
## $ x7 <dbl> 0.8941811, 0.8911175, 0.7456002, 0.9214166, 0.9253575, 0.8183267, 0…
## $ x8 <dbl> 2.3458047, 0.4173961, 0.6072066, 1.9577240, -2.5307052, -1.7593410,…
## $ x9 <dbl> 0.08401409, 0.96465840, 0.09453490, 0.09908694, 0.82311863, 0.89395…
df %>% ls()
## [1] "x1" "x2" "x3" "x4" "x5" "x6" "x7" "x8" "x9"
df %>% 
  ggplot(mapping = aes(x = x1)) +
  geom_histogram(bins = 25) +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x2)) +
  geom_histogram(bins = 25) +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x3)) +
  geom_histogram(bins = 25) +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x4)) +
  geom_histogram(bins = 25) +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x5)) +
  geom_histogram(bins = 25) +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x6)) +
  geom_histogram(bins = 25) +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x7)) +
  geom_histogram(bins = 25) +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x8)) +
  geom_histogram(bins = 25) +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x9)) +
  geom_histogram(bins = 25) +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x1)) +
  geom_histogram(bins = 25,
                 fill = 'dodgerblue',
                 mapping = aes(y = after_stat(density))) +
  geom_density(linewidth = 1.2, color = 'black') +
  geom_rug() +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x2)) +
  geom_histogram(bins = 25,
                 fill = 'dodgerblue',
                 mapping = aes(y = after_stat(density))) +
  geom_density(linewidth = 1.2, color = 'black') +
  geom_rug() +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x3)) +
  geom_histogram(bins = 25,
                 fill = 'dodgerblue',
                 mapping = aes(y = after_stat(density))) +
  geom_density(linewidth = 1.2, color = 'black') +
  geom_rug() +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x4)) +
  geom_histogram(bins = 25,
                 fill = 'dodgerblue',
                 mapping = aes(y = after_stat(density))) +
  geom_density(linewidth = 1.2, color = 'black') +
  geom_rug() +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x5)) +
  geom_histogram(bins = 25,
                 fill = 'dodgerblue',
                 mapping = aes(y = after_stat(density))) +
  geom_density(linewidth = 1.2, color = 'black') +
  geom_rug() +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x6)) +
  geom_histogram(bins = 25,
                 fill = 'dodgerblue',
                 mapping = aes(y = after_stat(density))) +
  geom_density(linewidth = 1.2, color = 'black') +
  geom_rug() +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x7)) +
  geom_histogram(bins = 25,
                 fill = 'dodgerblue',
                 mapping = aes(y = after_stat(density))) +
  geom_density(linewidth = 1.2, color = 'black') +
  geom_rug() +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x8)) +
  geom_histogram(bins = 25,
                 fill = 'dodgerblue',
                 mapping = aes(y = after_stat(density))) +
  geom_density(linewidth = 1.2, color = 'black') +
  geom_rug() +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x9)) +
  geom_histogram(bins = 25,
                 fill = 'dodgerblue',
                 mapping = aes(y = after_stat(density))) +
  geom_density(linewidth = 1.2, color = 'black') +
  geom_rug() +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x1)) +

  geom_vline(xintercept = quantile( df$x1 ),
             color = 'blue', linetype = 'dashed', linewidth = 1.) +
  stat_ecdf(linewidth = 2) +
  labs(y = 'cumulative proportion') +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x2)) +

  geom_vline(xintercept = quantile( df$x2 ),
             color = 'blue', linetype = 'dashed', linewidth = 1.) +
  stat_ecdf(linewidth = 2) +
  labs(y = 'cumulative proportion') +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x3)) +

  geom_vline(xintercept = quantile( df$x3 ),
             color = 'blue', linetype = 'dashed', linewidth = 1.) +
  stat_ecdf(linewidth = 2) +
  labs(y = 'cumulative proportion') +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x4)) +

  geom_vline(xintercept = quantile( df$x4 ),
             color = 'blue', linetype = 'dashed', linewidth = 1.) +
  stat_ecdf(linewidth = 2) +
  labs(y = 'cumulative proportion') +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x5)) +

  geom_vline(xintercept = quantile( df$x5 ),
             color = 'blue', linetype = 'dashed', linewidth = 1.) +
  stat_ecdf(linewidth = 2) +
  labs(y = 'cumulative proportion') +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x6)) +

  geom_vline(xintercept = quantile( df$x6 ),
             color = 'blue', linetype = 'dashed', linewidth = 1.) +
  stat_ecdf(linewidth = 2) +
  labs(y = 'cumulative proportion') +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x7)) +

  geom_vline(xintercept = quantile( df$x7 ),
             color = 'blue', linetype = 'dashed', linewidth = 1.) +
  stat_ecdf(linewidth = 2) +
  labs(y = 'cumulative proportion') +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x8)) +

  geom_vline(xintercept = quantile( df$x8 ),
             color = 'blue', linetype = 'dashed', linewidth = 1.) +
  stat_ecdf(linewidth = 2) +
  labs(y = 'cumulative proportion') +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x9)) +

  geom_vline(xintercept = quantile( df$x9 ),
             color = 'blue', linetype = 'dashed', linewidth = 1.) +
  stat_ecdf(linewidth = 2) +
  labs(y = 'cumulative proportion') +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x1)) +
  geom_hline(yintercept = quantile(df$x1),
             color = 'red', linetype = 'dashed', linewidth = 1.) +
  stat_ecdf(linewidth = 2) +
  labs(y = 'cumulative proportion') +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x2)) +
  geom_hline(yintercept = seq(0, 1, by = 0.25),
             color = 'red', linetype = 'dashed', linewidth = 1.) +
  stat_ecdf(linewidth = 2) +
  labs(y = 'cumulative proportion') +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x3)) +
  geom_hline(yintercept = seq(0, 1, by = 0.25),
             color = 'red', linetype = 'dashed', linewidth = 1.) +
  stat_ecdf(linewidth = 2) +
  labs(y = 'cumulative proportion') +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x4)) +
  geom_hline(yintercept = seq(0, 1, by = 0.25),
             color = 'red', linetype = 'dashed', linewidth = 1.) +
  stat_ecdf(linewidth = 2) +
  labs(y = 'cumulative proportion') +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x5)) +
  geom_hline(yintercept = seq(0, 1, by = 0.25),
             color = 'red', linetype = 'dashed', linewidth = 1.) +
  stat_ecdf(linewidth = 2) +
  labs(y = 'cumulative proportion') +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x6)) +
  geom_hline(yintercept = seq(0, 1, by = 0.25),
             color = 'red', linetype = 'dashed', linewidth = 1.) +
  stat_ecdf(linewidth = 2) +
  labs(y = 'cumulative proportion') +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x7)) +
  geom_hline(yintercept = seq(0, 1, by = 0.25),
             color = 'red', linetype = 'dashed', linewidth = 1.) +
  stat_ecdf(linewidth = 2) +
  labs(y = 'cumulative proportion') +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x8)) +
  geom_hline(yintercept = seq(0, 1, by = 0.25),
             color = 'red', linetype = 'dashed', linewidth = 1.) +
  stat_ecdf(linewidth = 2) +
  labs(y = 'cumulative proportion') +
  theme_bw()

df %>% 
  ggplot(mapping = aes(x = x9)) +
  geom_hline(yintercept = seq(0, 1, by = 0.25),
             color = 'red', linetype = 'dashed', linewidth = 1.) +
  stat_ecdf(linewidth = 2) +
  labs(y = 'cumulative proportion') +
  theme_bw()

You must insert all code chunks. Your solutions must be well organized and easy to follow, but you may organize the code however you see fit. You CANNOT use any packages outside of the tidyverse to create the figures.

You must also answer the following conceptual questions for each variable:

You may use for-loops to aid in the creation of the figures, but that is not required. I would not use for-loops if I was in your shoes! However, it should be OBVIOUS which variable you are visualizing and discussing in your response.

HINT: The Week 03 readings will greatly help with this problem.

SOLUTION

Add your code chunks here.

Problem 04

You explored the counts of categorical variables for the EPA AQS data in the previous assignment. That data set included a continuous variable, daily_avg, which was not explored. You will explore that variable in this assignment! You will only explore the EPA AQS daily_avg variable in R.

The code chunk below reads in the EPA AQS data and assigns the data to the epa variable. The code chunk ASSUMES the EPA AQS data are contained within the 'hw/02' sub-directory. The file path includes '../02' which represents moving backwards and then entering the '02' subdirectory. Thus, the code chunk below assumes this RMarkdown is within the 'hw/03' subdirectory.

epa <- readr::read_csv('../Week_02/epa_airdata_small.csv', col_names = TRUE)
## Rows: 34112 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (6): state, county, city, site, pollutant, units_of_measure
## dbl  (1): daily_avg
## date (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
epa %>% glimpse()
## Rows: 34,112
## Columns: 8
## $ state            <chr> "Pennsylvania", "Pennsylvania", "Pennsylvania", "Penn…
## $ county           <chr> "Allegheny", "Allegheny", "Allegheny", "Allegheny", "…
## $ city             <chr> "Wilkinsburg", "Wilkinsburg", "Wilkinsburg", "Wilkins…
## $ site             <chr> "1376", "1376", "1376", "1376", "1376", "1376", "1376…
## $ pollutant        <chr> "Carbon monoxide", "Carbon monoxide", "Carbon monoxid…
## $ date             <date> 2019-01-01, 2019-01-02, 2019-01-03, 2019-01-04, 2019…
## $ daily_avg        <dbl> 0.391957, 0.400957, 0.341391, 0.744391, 0.464826, 0.4…
## $ units_of_measure <chr> "Parts per million", "Parts per million", "Parts per …

4a)

Create the MARGINAL histogram for daily_avg using an appropriate number of bins based on the sample size.

SOLUTION

Add your code chunks here.

epa %>% 
  ggplot(mapping = aes(x = daily_avg)) +
  geom_histogram(bins = 17) +
  theme_bw()

4b)

Although the previous figure seemed like a good place to start, the figure is incorrect. The previous histogram makes no sense for this application! Your categorical variable exploration in the previous assignment demonstrated why the histogram in 4a) makes no sense. However, let’s find out by creating several conditional histograms in this assignment.

Create the histogram of daily_avg again but this time FACET based on the pollutant variable. The scales should be free (or allowed to float) across the wrapped facets.

SOLUTION

Add your code chunks here.

epa %>% 
  ggplot(mapping = aes(x = daily_avg)) +
  geom_histogram(bins = 11) +
  facet_wrap( ~pollutant, scales = 'free' ) +
  theme_bw()

4c)

Facets can be associated with more than one variable. Adding additional variables to the FACETS is simple using the facet formula interface. The syntax below shows how facets can be associated with TWO variables within the facet_wrap() function:

facet_wrap( ~ <variable 1> + <variable 2> )

The <variable 1> and <variable 2> names serve as place holders for the facetting variables of interest. The + reads as “Facet based on <variable 1> AND <variable 2>”.

Create the histogram of daily_avg again but this time FACET based on two variables. The facets must be associated with the pollutant variable AND the units_of_measure variable. The scales should be free (or allowed to float) across the wrapped facets.

SOLUTION

epa %>% 
  ggplot(mapping = aes(x = daily_avg)) +
  geom_histogram(bins = 11) +
  facet_wrap( ~pollutant + units_of_measure , scales = 'free' ) +
  theme_bw()

Add your code chunks here.

4d)

Now that you have visualized the distribution of daily_avg conditioned on two important categorical variables, let’s revisit why the histogram in 4a) makes no sense.

Why was the histogram in 4a) a poor figure to create for this application?

SOLUTION

Because by default it was choosing only one of the options to plot the histogram and neglecting the other variables that daily_avg was conditioned to, such as pollutant and units_of_measure

Problem 05

The EPA AQS data consists of measurements recorded at multiple sites across cities and counties from Western PA. Thus, it would be particularly interesting to consider the conditional distribution of the daily_avg variable GIVEN the site! This allows you to explore if the daily_avg distributions are different at different sites!

5a)

There are multiple approaches to visualize conditional distributions. Let’s start with the box plot within this assignment.

Create a boxplot for the daily_avg conditioned on the city, pollutant, and units_of_measure. The daily_avg variable must be mapped to the x aesthetic, the city must be mapped to the y aesthetic within the PARENT ggplot() call. The facets must be associated with pollutant and units_of_measure. You are therefore creating facetted HORIZONTAL boxplots.

SOLUTION

epa %>% 
  ggplot(mapping = aes(x = daily_avg, y = city)) +
  geom_boxplot(fill = 'pink', color = 'purple') +
  facet_wrap( ~pollutant + units_of_measure , scales = 'free' ) +
  theme_bw()

5b)

Boxplots include additional aesthetics that can be associated with categorical variables. The boxplot fill and color can be used to represent a variable different from the variable associated with the main aesthetic (the x aesthetic for vertical boxplots and the y aesthetic for horizontal boxplots). This allows you to group by an additional variable. The boxplot is therefore summarizing the distribution GIVEN the combination of the main aesthetic and the fill aesthetic.

Create a boxplot for the daily_avg conditioned on the city, county, pollutant, and units_of_measure. The daily_avg variable must be mapped to the x aesthetic, the city must be mapped to the y aesthetic within the PARENT ggplot() call. You must map the county variable to the fill and color aesthetics within the boxplot layer. Hard code the boxplot alpha to be 0.33. The facets must be associated with pollutant and units_of_measure. You are therefore creating facetted HORIZONTAL boxplots.

You may use the default fill and color palettes.

SOLUTION

epa %>% 
  ggplot(mapping = aes(x = daily_avg, y = city)) +
    geom_boxplot(aes(fill = county, color = county), alpha = 0.33) +
    facet_grid(pollutant ~ units_of_measure, scales = "free") +
    theme_bw()

epa %>% 
  ggplot(mapping = aes(x =daily_avg  , y = city)) +
  geom_boxplot(aes(fill = county, color = county), alpha = 0.33) +
  facet_wrap( ~pollutant + units_of_measure , scales = 'free' ) +
  theme_bw()

5c)

What changed about the boxplots when you included the additional county grouping?

SOLUTION

What do you think?
When the additional county grouping is included in the boxplots, each boxplot is now differentiated by both the city and county variables. This means that within each city, different counties are represented by different colors and fills. This allows for a more detailed analysis and comparison of the daily_avg variable across cities and counties within each city. The inclusion of the county grouping provides additional insights into the variability of the data within each city, highlighting any differences or similarities between counties within the same city.

Problem 06

Now it’s time to start exploring a data set of your own choosing! Last assignment you considered several potential data sets for your final project. You may use one of those here or you can use a different data set for this problem. It is up to you!

Regardless of your choice of data set, you must perform the following actions:

library(tidytuesdayR)
big_tech_stock_prices <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-02-07/big_tech_stock_prices.csv')
## Rows: 45088 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): stock_symbol
## dbl  (6): open, high, low, close, adj_close, volume
## date (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(ggplot2)
library(readr)
library(dplyr)
big_tech_stock_prices %>% glimpse()
## Rows: 45,088
## Columns: 8
## $ stock_symbol <chr> "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "…
## $ date         <date> 2010-01-04, 2010-01-05, 2010-01-06, 2010-01-07, 2010-01-…
## $ open         <dbl> 7.622500, 7.664286, 7.656429, 7.562500, 7.510714, 7.60000…
## $ high         <dbl> 7.660714, 7.699643, 7.686786, 7.571429, 7.571429, 7.60714…
## $ low          <dbl> 7.585000, 7.616071, 7.526786, 7.466071, 7.466429, 7.44464…
## $ close        <dbl> 7.643214, 7.656429, 7.534643, 7.520714, 7.570714, 7.50392…
## $ adj_close    <dbl> 6.515213, 6.526476, 6.422664, 6.410790, 6.453412, 6.39648…
## $ volume       <dbl> 493729600, 601904800, 552160000, 477131200, 447610800, 46…
library(readr)
is_numeric <- sapply(big_tech_stock_prices, is.numeric)
print(is_numeric)
## stock_symbol         date         open         high          low        close 
##        FALSE        FALSE         TRUE         TRUE         TRUE         TRUE 
##    adj_close       volume 
##         TRUE         TRUE
str(big_tech_stock_prices)
## spc_tbl_ [45,088 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ stock_symbol: chr [1:45088] "AAPL" "AAPL" "AAPL" "AAPL" ...
##  $ date        : Date[1:45088], format: "2010-01-04" "2010-01-05" ...
##  $ open        : num [1:45088] 7.62 7.66 7.66 7.56 7.51 ...
##  $ high        : num [1:45088] 7.66 7.7 7.69 7.57 7.57 ...
##  $ low         : num [1:45088] 7.58 7.62 7.53 7.47 7.47 ...
##  $ close       : num [1:45088] 7.64 7.66 7.53 7.52 7.57 ...
##  $ adj_close   : num [1:45088] 6.52 6.53 6.42 6.41 6.45 ...
##  $ volume      : num [1:45088] 4.94e+08 6.02e+08 5.52e+08 4.77e+08 4.48e+08 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   stock_symbol = col_character(),
##   ..   date = col_date(format = ""),
##   ..   open = col_double(),
##   ..   high = col_double(),
##   ..   low = col_double(),
##   ..   close = col_double(),
##   ..   adj_close = col_double(),
##   ..   volume = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
unique_counts <- sapply(big_tech_stock_prices, function(x) {
  if (is.factor(x)) {
    nlevels(x)
  } else {
    length(unique(x))
  }
})

print(unique_counts)
## stock_symbol         date         open         high          low        close 
##           14         3287        29798        30056        30028        30965 
##    adj_close       volume 
##        41279        43550
missing_counts <- colSums(is.na(big_tech_stock_prices))
print(missing_counts)
## stock_symbol         date         open         high          low        close 
##            0            0            0            0            0            0 
##    adj_close       volume 
##            0            0

You must create figures to start exploring the data. However, you do not need to explore ALL variables in this assignment. This is just to get you started!

The requirements depend on the variables in your data. Please see the bullets below for what is required.

big_tech_stock_prices %>%
  count(stock_symbol) %>% 
  ggplot(mapping = aes(x = stock_symbol)) +
  geom_bar(color = 'green', fill = 'pink') +
  geom_text(stat = 'count', aes(label = ..count..), vjust = -0.5, color = 'black') +

  theme_bw()
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

big_tech_stock_prices %>% 
  ggplot(mapping = aes(x = open)) +
  geom_histogram(bins = 51, color = 'purple', fill = 'pink',
                 linewidth = 0.55,
                 alpha = 0.33) +
  theme_bw()

big_tech_stock_prices %>% 
  ggplot(mapping = aes(x = high)) +
  geom_histogram(bins = 51, color = 'purple', fill = 'pink',
                 linewidth = 0.55,
                 alpha = 0.33) +
  theme_bw()

stock_symbol date open high low close adj_close volume FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE

big_tech_stock_prices %>% 
  ggplot(mapping = aes(x = low)) +
  geom_histogram(bins = 51, color = 'purple', fill = 'pink',
                 linewidth = 0.55,
                 alpha = 0.33) +
  theme_bw()

big_tech_stock_prices %>% 
  ggplot(mapping = aes(x = close)) +
  geom_histogram(bins = 51, color = 'purple', fill = 'pink',
                 linewidth = 0.55,
                 alpha = 0.33) +
  theme_bw()

big_tech_stock_prices %>% 
  ggplot(mapping = aes(x = adj_close)) +
  geom_histogram(bins = 51, color = 'purple', fill = 'pink',
                 linewidth = 0.55,
                 alpha = 0.33) +
  theme_bw()

big_tech_stock_prices %>% 
  ggplot(mapping = aes(x = volume)) +
  geom_histogram(bins = 51, color = 'purple', fill = 'pink',
                 linewidth = 0.55,
                 alpha = 0.33) +
  theme_bw()

The number of bins in the histograms should be appropriate based on the sample size.

If you are having issues reading the data into R please contact Dr. Yurko right away!!!!!!